Search CORE

92 research outputs found

Combining semantic and syntactic generalization in example-based machine translation

Author: Ebling Sarah
Kumar Naskar Sudip
Volk Martin
Way Andy
Publication venue: European Association for Machine Translation
Publication date: 30/05/2011
Field of study

In this paper, we report our experiments in combining two EBMT systems that rely on generalized templates, Marclator and CMU-EBMT, on an English–German translation task. Our goal was to see whether a statistically signiﬁcant improvement could be achieved over the individual performances of these two systems. We observed that this was not the case. However, our system consistently outperformed a lexical EBMT baseline system

CiteSeerX

Irish Universities

DCU Online Research Access Service

Target-Level Sentence Simplification as Controlled Paraphrasing

Author: Ebling Sarah
Kew Tannon
Publication venue
Publication date: 08/12/2022
Field of study

Automatic text simplification aims to reduce the linguistic complexity of a text in order to make it easier to understand and more accessible. However, simplified texts are consumed by a diverse array of target audiences and what might be appropriately simplified for one group of readers may differ considerably for another. In this work we investigate a novel formulation of sentence simplification as paraphrasing with controlled decoding. This approach aims to alleviate the major burden of relying on large amounts of in-domain parallel training data, while at the same time allowing for modular and adaptive simplification. According to automatic metrics, our approach performs competitively against baselines that prove more difficult to adapt to the needs of different tar- get audiences or require significant amounts of complex-simple parallel aligned data

ZORA

20 Minuten: A Multi-task News Summarisation Dataset for German

Author: Ebling Sarah
Kew Tannon
Kostrzewa Marek
Publication venue
Publication date: 14/06/2023
Field of study

Automatic text summarisation (ATS) is a central task in natural language processing that aims to reduce a long document into a shorter, concise summary that conveys its key points. Extractive approaches to ATS, which identify and copy the most important sentences or phrases from the original text, have long been a popular choice, but these summaries suffer from being incohesive and disjointed. More recently, abstractive approaches to ATS have gained popularity thanks to advancements in neural text generation. Yet, much of the research on ATS has been limited to English, due to its high-resource dominance. This work introduces a new dataset for German- language news summarisation. Aside from summarisation, the dataset also allows for addressing additional NLP tasks such as image caption generation and read- ing time prediction. Furthermore, it is multi-purpose since article summaries cover a range of styles, including headlines, lead paragraphs and bullet-point summaries. In order to showcase the versatility of our dataset for different NLP tasks, we conduct experiments using mT5 [2] and compare the performance on six different tasks under single- and multi-task fine-tuning conditions, providing baselines for future work. Our findings show that dedicated models consistently perform better according to automatic metrics

ZORA

A Multilingual Simplified Language News Corpus

Author: Ebling Sarah
Hauser Renate
Vamvas Jannis
Volk Martin
Publication venue: European Language Resources Association
Publication date: 24/06/2022
Field of study

Simplified language news articles are being offered by specialized web portals in several countries. The thousands of articles that have been published over the years are a valuable resource for natural language processing, especially for efforts towards automatic text simplification. In this paper, we present SNIML, a large multilingual corpus of news in simplified language. The corpus contains 13k simplified news articles written in one of six languages: Finnish, French, Italian, Swedish, English, and German. All articles are shared under open licenses that permit academic use. The level of text simplification varies depending on the news portal. We believe that even though SNIML is not a parallel corpus, it can be useful as a complement to the more homogeneous but often smaller corpora of news in the simplified variety of one language that are currently in use

ZORA

Machine Translation between Spoken Languages and Signed Languages Represented in SignWriting

Author: Ebling Sarah
Jiang Zifan
Moryossef Amit
Müller Mathias
Publication venue: Association for Computational Linguistics
Publication date: 01/05/2023
Field of study

This paper presents work on novel machine translation (MT) systems between spoken and signed languages, where signed languages are represented in SignWriting, a sign language writing system. Our work seeks to address the lack of out-of-the-box support for signed languages in current MT systems and is based on the SignBank dataset, which contains pairs of spoken language text and SignWriting content. We introduce novel methods to parse, factorize, decode, and evaluate SignWriting, leveraging ideas from neural factored MT. In a bilingual setup—-translating from American Sign Language to (American) English—-our method achieves over 30 BLEU, while in two multilingual setups—-translating in both directions between spoken languages and signed languages—-we achieve over 20 BLEU. We find that common MT techniques used to improve spoken language translation similarly affect the performance of sign language translation. These findings validate our use of an intermediate text representation for signed languages to include them in natural language processing research

ZORA

Considerations for meaningful sign language machine translation based on glosses

Author: Ebling Sarah
Jiang Zifan
Moryossef Amit
Müller Mathias
Rios Annette
Publication venue
Publication date: 28/11/2022
Field of study

Automatic sign language processing is gaining popularity in Natural Language Processing (NLP) research (Yin et al., 2021). In machine translation (MT) in particular, sign language translation based on glosses is a prominent approach. In this paper, we review recent works on neural gloss translation. We find that limitations of glosses in general and limitations of specific datasets are not discussed in a transparent manner and that there is no common standard for evaluation. To address these issues, we put forward concrete recommendations for future research on gloss translation. Our suggestions advocate awareness of the inherent limitations of gloss-based approaches, realistic datasets, stronger baselines and convincing evaluation

arXiv.org e-Print Archive

ZORA

Linguistically Motivated Sign Language Segmentation

Author: Ebling Sarah
Goldberg Yoav
Jiang Zifan
Moryossef Amit
Müller Mathias
Publication venue
Publication date: 30/10/2023
Field of study

Sign language segmentation is a crucial task in sign language processing systems. It enables downstream tasks such as sign recognition, transcription, and machine translation. In this work, we consider two kinds of segmentation: segmentation into individual signs and segmentation into phrases, larger units comprising several signs. We propose a novel approach to jointly model these two tasks. Our method is motivated by linguistic cues observed in sign language corpora. We replace the predominant IO tagging scheme with BIO tagging to account for continuous signing. Given that prosody plays a significant role in phrase boundaries, we explore the use of optical flow features. We also provide an extensive analysis of hand shapes and 3D hand normalization. We find that introducing BIO tagging is necessary to model sign boundaries. Explicitly encoding prosody by optical flow improves segmentation in shallow models, but its contribution is negligible in deeper models. Careful tuning of the decoding algorithm atop the models further improves the segmentation quality. We demonstrate that our final models generalize to out-of-domain video content in a different signed language, even under a zero-shot setting. We observe that including optical flow and 3D hand normalization enhances the robustness of the model in this context.Comment: Accepted at EMNLP 2023 (Findings

arXiv.org e-Print Archive